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Abstract 

Variations in DNA copy number carries information on the modalities of genome evolution 
and misregulation of DNA replication in cancer cells; their study can be helpful to localize 
tumor suppressor genes, distinguish different populations of cancerous cell, as well identify 
genomic variations responsible for disease phenotypes. A number of different high throughput 
technologies can be used to identify copy number variable sites, and the literature documents 
multiple effective algorithms. We focus here on the specific problem of detecting regions 
where variation in copy number is relatively common in the sample at hand: this encompasses 
the cases of copy number polymorphisms, related samples, technical replicates, and cancerous 
sub-populations from the same individual. We present an algorithm based on regularization 
approaches with significant computational advantages and competitive accuracy. We illustrate 
its applicability with simulated and real data sets. 
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1 Introduction 



Duplication and deletion of genomic materials are common in cancer cells and known to play a 
role in the establishment of the tumor status [|24|. As our ability to survey the fine scale of the 
human genome has increased, it has become apparent that normal cells can also harbor a number 
of variations in copy number ||T8l |36l . The last few years have witnessed a steady increase in 
our knowledge of size and frequency of these variants IfTOl [T9l |23l |30l and their implications in 
complex diseases [|29ll40l . 

At the same time, statistical methods and algorithms have been developed to better harness 
the information available. At the cost of oversimplification, two different approaches have become 
particularly popular: one is based on the hidden Markov model (HMM) machinery and explicitly 
aims to reconstruct the unobservable discrete DNA copy number; the other, which we will gener- 
ically call "segmentation", aims at identifying portions of the genome that have constant copy 
number, without specifically reconstructing it. The HMM approach takes advantage of the implic- 
itly discrete nature of the copy number process (both when a finite number of states is assumed 
and when, as in some implementations, less parametric approaches are adopted); furthermore, by 
careful modeling of the emission probabilities, one can fully utilize the information derived from 
the experimental results. In the case of genotyping arrays, for example, both the quantification of 
total DNA amount and relative allelic abundance as well as prior information (for example, mi- 
nor allele frequencies) can be considered. No a-priori knowledge of the number of copy number 
states is required the segmentation approach — an advantage in the study of cancer where poly- 
ploids and contamination with normal tissues result in a wide range of fractional copy numbers. 
Possibly for the reasons outlined, HMMs are the methods of choice in the analysis of normal sam- 
ples flU [34] BH H71 |49l, while segmentation methods are the standard in cancer studies ll26l l53l . 
A limitation of segmentation methods is that they rely on the data in which the variation in copy 
number is reflected in the differences in means of the segments — which make them applicable di- 
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rectly to a substantial portion of the data derived from recent technologies, but not to relative allelic 
abundance (see the modification suggested in [39] and following description for an exception). 

While a number of successful approaches have been derived along the lines described above, 
there is still a paucity of methodology for the joint analysis of multiple sequences. It is clear that 
if multiple subjects share the same variation in copy number, there exists the potential to increase 
power by joint analysis. Wang et al. (2009) ll45l presented a methodology that extended [|24| to 
reconstruct the location of tumor suppressor genes from the identification of regions lost in a larger 
number of samples; the initial steps of the Birdsuite algorithm rely on the identification of suspect 
signals in the context of multiple samples; PennCNV iPTTll includes an option of joint analysis 
of trios; methodology to process multiple samples with the context of change point analysis has 
been developed in ||37ll5Tll53l ; Efron and Zhang (201 1) lPT4ll consider FDR analysis of independent 
samples to identify copy number polymorphysms (CNPs); and Nowak et al. (201 1) E51l use a latent 
feature model to capture, in joint analysis of array-CGH data from multiple tumor samples, shared 
copy number profiles, on each of which a fused-lasso penalty is enforced for sparsity. In the present 
work we consider a setting similar to [|53l in that we want joint analysis to inform the segmentation 
of multiple samples. Our main focus is the analysis of genotyping array data, but the methodology 
we develop is applicable to a variety of platforms. By adopting a flexible framework we are able, 
for example, to define a segmentation algorithm that uses all information from Illumina genotyping 
data. As in 113711 . we are interested in the situation when not all the samples under consideration 
carry a copy number variant (CNV): we rather want to enforce a certain sparsity in the vector that 
identifies which samples carry a given variant. We tackle this problem using a penalized estimation 
approach, originally proposed in this context by [42], on which we have developed an algorithmic 
implementation before 11541 . Appreciable results are achieved in terms of speed, accuracy and 
flexibility. In concluding this introduction, we would like to make an important qualification: the 
focus of our contribution is on segmentation methods, knowing that this is only one of the steps 
necessary for an effective recovery of CNVs. In particular, normalization and transformation of 
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the signal from experimental sources are crucial and can have a very substantial impact on final 
results: we refer the reader to IB IH d [12l [33l [35J , for example. Furthermore, calling procedures 
that further classify results of segmentation while possibly controlling global error measures lfl4l 
are also needed. Indeed, in the data analysis included in this paper we need to resort to both these 
additional steps and we will describe briefly the fairly standard choices we are making. 

The rest of the paper is organized as follows: Section [2] motivates the need for joint analysis 
of multiple signals and presents the penalized estimation framework. Section [3] describes how 
the model can be used for data analysis by (a) outlining an efficient estimation algorithm, (b) 
generalizing it to the case of uncoordinated data, and (c) describing the choice of the penalization 
parameters. Section [4] illustrates our results on two simulated data sets (descriptive of normal and 
tumor samples) and two real data sets: in one case multiple platforms are used to analyze the same 
sample and in the other case samples from related individuals benefit from joint analysis. 

2 Multiple sequence segmentation 

The goal of the present paper is to develop a flexible methodology for joint segmentation of mul- 
tiple sequences that are presumed to carry related information on CNVs. We start by illustrating a 
series of contexts where the joint analysis appears to be useful. 

2.1 Motivation 

2.1.1 Genotyping arrays and CNV detection 

Genotyping arrays have been used on hundreds of thousands of subjects and the data collected 
through them provides an extraordinary resource for CNV detection and the study of their frequen- 
cies in multiple populations. Typically, the raw intensity data representing hybridization strength 
is processed to obtain two signals: a quantification of total DNA amount (from now on log R 
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Ratio LRR, following Illumina terminology) and a relative abundance of the two queried alleles 
(from now on B allele frequency, BAF). Both these signals contain information on CNV and one 
of the strengths of HMM models has been that they can easily process them jointly. Segmenta- 
tion models like CBS have traditionally relied only on LRR. While this is a reasonable choice, 
it can lead to substantial loss of information, particularly in tumor cells, where poliploidity and 
contamination make information in LRR hard to decipher. To exploit BAF in the context of a 
segmentation method, a signal transformation has been suggested lf39ll : mirrowed BAF (mBAF) 
relies on exchangeability of the two alleles and the low information content of homozygous SNPs. 
The resulting mBAF is defined on a coarser grid than the original BAF, but is characterized by 
changing means in presence of CNV. While Il39ll shows that its analysis alone can be advantageous 
and more powerful than segmentation of LRR in some contexts, clearly a joint analysis of LRR 
and mBAF should be preferable to an arbitrary selection of one or the other signal. 

2.1.2 Multiple platforms 

LRR and BAF are just one example of the multiple signals that one can have available for the same 
sample. Often, as research progresses, the samples are assessed with a variety of technologies. 
For example, a number of subjects that have been genotyped at high resolution are now being 
resequenced. Whenever the technology adopted generates a signal that contains some information 
on copy number, there is an incentive to analyze the available signals jointly. 

2.1.3 Tumor samples from the same patient obtained at different sites or different progres- 
sion stages 

In an effort to identify mutations that are driving a specific tumor, as well as study its response 
to treatment, researchers might want to study CNVs in cells obtained at different tumor sites or 
at different time points ll27Tl . Copy number is highly dynamic in cancer cells, so that it is to be 
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expected that some differences be detected over time or across sites. In contrast, the presence of 
the same CNVs across these samples, can be taken as an indication that the tumors share the same 
origin: therefore a comparative analysis of CNV can be used to distinguish resurgence of the same 
cancer from insurgence of a new one, or to identify specific cancer cell populations. Given that 
the tissue extracted always consists of a mixture of normal and cancer cells, which are in turn a 
mixture of different populations, joint analysis of the signals from the varied materials is much 
more likely to lead to the identification of common CNVs, when these exist. 

2.1.4 Related subjects 

Family data is crucial in genetic investigations and hence it is common to analyze related subjects. 
When studying individuals from the same pedigree, it is reasonable to assume that some CNVs 
might be segregating in multiple people: joint analysis would reduce Mendelian errors and increase 
power of detection. 

2.2 A model for joint analysis of multiple signals 

Assume we have observed M signals, each measured at N locations, corresponding to ordered 
physical positions along the genome, with being the observed value of sequence % at location j. 
The copy number process can be modeled as 

Vij = Pij +€ij, (1) 

where represent noise, and the mean values are piece-wise constant: there exists a linearly 
ordered partition {R^\ R%\ • • • , RrJ °f me location index {1,2,..., N} such that f3 is = ■ ■ ■ = 
Pa = for s, . . . , t e R$ and 1 < k < K{. In other words, most of the increments \/3ij — Aj-i I 
are assumed to be zero. When two sequences k and / share a CNV with the same boundaries 
at location j, both \/3 kj — Pk,j-i\ an d |Aj — w ^ ^ e different from zero in correspondence 

of the change point. Modulo an appropriate signal normalization, ^ = can be interpreted 
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as corresponding to the appropriate normal copy number equal to 2. We propose to reconstruct 
the mean values (3 by minimizing the following function, called hereafter generalized fused lasso 
(GFL): 



M N M N M N N 

i=l j=l i=l j=l i=l j=2 j=2 



' M 

i=i 

(2) 



which includes a goodness-of-fit term and three penalties, whose roles we will explain one at the 
time. The t\ penalty J2i=x X^=i I Ail enforces sparsity within f3, in favor of values = 0, cor- 
responding to the normal copy number. The total variation penalty J2f=2 IAj — Aj-i| minimizes 
the number of jumps in the piece-wise constant means of each sequence and was introduced by 



42J in the context of CNV reconstruction from array-CGH data. Finally, the Euclidean penalty on 



the column vector of jumps y YlfLiiPij ~ Aj-i) 2 is a form of the group penalty introduced by 
ll50ll and favors common jumps across sequences. As clearly explained in 11561 . "the local penalty 
around for each member of a group relaxes as soon as the — for one member % of the 

group moves off 0." Bleakley and Vert (201 1) |4) also suggested the use of this group-fused-lasso 
penalty to reconstruct CNV. We here consider the use of both the total variation and the Euclidean 
penalty on the jumps to achieve the equivalent effect of the sparse group lasso, which, as pointed 
out in |fT6l , favors CNV detection in multiple samples, allowing for sparsity in the vector indicat- 
ing which subjects are carriers of the variant. This property is important in situations as presented 



in Section [2. 1 .3| and [2. 1 A\ where one does not want to assume that all the M sequences carry the 
same CNV. 

The incorporation of the latter two penalties can also be naturally interpreted in view of image 
denoising. To restore an image disturbed by random noise while preserving sharp edges of items in 
the image, a 2-D total variation penalty A Yjf=x Y^=2 I Aj _ Aj-i I + P T,f=i X^= 2 I Pa ~ ft-ij I is 



proposed in a regularized least-square optimization B21 . where /3y- is the true underlying intensity 
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of pixel In CNV detection problems, signals from multiple sequences can be aligned up in 

shape of an image, except that pixels in each sequence are linearly ordered while sequences as a 
group have no certain order a priori; thus one of the two total variation penalties is replaced by the 
group penalty on the column vector of jumps. 

Using matrix notation, and allowing the tuning parameter Ai, A2 and A3 to be sequence spe- 
cific, we can reformulate the objective function as follows. Let Y = (y^) M xn and (3 = (/% )mxAt- 
Let f3 { be the ith row of (3 and (3^ the jth column of (3. Also, let A 3 = {\s,i)Mxi- Then we have 

1 - 
f((3) = -||Y-/3||| + ^A M ||/3J| £l 

i=i 

M N 

+ A 2 ,il \(3 ia . N - PiMN-Y) 1 W + G%) -%-i))lk> (3) 

i=l j=2 

where 1 1 ■ | \ F is the Frobenius norm of matrix, 1 1 ■ | \ tl and 1 1 ■ | \i 2 are l\ and £ 2 norm of vector, f3 { s:t 
indicates the sub- vector with elements /3 iiS , . . . , (3 ijt in row vector f3 { , and "*" is used as entry-wise 
multiplication between two vectors. Note that it would be easy to modify the tuning parameters 
so as to make them location specific: that is, reduce the penalty for a jump in correspondence of 
genomic regions known to harbor CNVs. 

3 Implementation 

3.1 An MM algorithm 

While the solution to the optimization problem ([3]) might have interesting properties, this approach 
is useful only if an effective algorithm is available. The last few years have witnessed substantial 
advances in computational methods for £i-regularization problems, including the use of coordinate 
descent ITT51I481 and path following methods (H[T71|43l|55l. The time cost of these methods in the 
best situation is O(MNK), for K knots along the solution path. It is important to note that these 
algorithms - some of which are designed for more general applications - may not be the most 
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efficient for large scale CNV analysis for at least two reasons: on the one hand, reasonable choices 
of A might be available, making it unnecessary to solve for the entire path; on the other hand, the 
number of knots K can be expected to be as large as O(N), making the computational costs of 
path algorithms prohibitive. 

With specific regard to the fused-lasso application to CNV detection, we were successful in 
developing algorithm with per iteration cost O(N) and empirically fast convergence rate for the 
analysis of one sequence ll54ll . We apply the same principles here. We start by modifying the 
norms in the penalty as follows: rather than the l\ norm we use | \x\ \2, e = Vx 2 + e for sufficiently 
small e, and, for computational stability, we also substitute £ 2 norm with | |x| | 2)£ = (X)£=i x \ + e ) 5 > 
obtaining a differentiable objective function 

1 M N M N 

i=l j=l i=l j=l 

M N N 

+ E A ^ E 11^ - ftj-ilk« + E H A 3 * - %-i))lk- ( 4 ) 

i=l j=2 j=2 

Adopting an MM framework EH . we want to find a surrogate function g 6 (f3 \ fi^) for each 
iteration m such that g t ((3 (m) | /3 (m) ) = f t {(3 (m) ) and g e (/3 \ (3 {m) ) > f e ((3) for all f3. At each 
iteration, then, f3 ( ' = argmmg e majorizing function with the above properties 

is readily obtained using the concavity of square-root function ||x||2, e < 2 \\z\\ 2 ^ ~ z2 )' an ^ * ts 
vector equivalent ||x||2, £ < 2 ||z^| 2 (ll x lli 2 — ll z ll? 2 )- The resulting 



M N M N 02 

g,(P I p (m) ) = ^EEte-^ + Ev^-JI 

- ■ • i 2\\p r 



i=l j=\ i=l j=l ^WPij l|2,e 

+ f A f (Aj ~ a„--i) 2 f HAa * (gg - /%-!)) Ill (ro) 
tT ^ 211/3^-/3^11^ ^2||A 3 *(/3| j "; ) -^ 1) )|k e 



can be decomposed in the sum of similar functions of all the row vectors /3 i 

M 

9*{P\fi m) ) = X>(/9, | /9 (m) ), 



i=i 



where 

gm\p [m) ) = IfrAt^j-ibtYpJ+t^ (5) 

Here each is a tridiagonal symmetric matrix, and c[ m ^ is irrelevant constant for opti- 
mization purpose. In view of the strict convexity of the surrogate function, each A^ is also 
positive definite. The nonzero entries of A-"^ and b-"^ (z = 1, . . . , M) are listed in the sup- 
plementary material. Each of the surrogate functions in ([5]) can be minimized solving the linear 
system f3 i = [fi^Y [A^y 1 by the Tri-diagonal Matrix (TDM) algorithm JTT). This results in 
a per-iteraction computational cost of 0(MN). This algorithm is empirically observed to achieve 
an exponential convergence rate [54J, although we do not yet have an analytic proof. In practice, 
this method scales well with joint analysis of tens to hundreds of samples with measurements at 
millions of locations, with limitations dictated by memory requirements. For analysis of real data, 
we suggest one or a group of samples to be analyzed chromosome by chromosome, since a CNV 
region can never extend beyond one chromosome to another. Actual computation times are shown 
along with different examples in Section |4j 

3.2 Stacking observations at different genomic locations 

While copy number is continuously defined across the genome, experimental procedures record 
data at discrete positions, for which we have used the indexes j = 1, . . . , N. In reality, repeated 
evaluations of the same sample (or related samples) will typically result in measurements at only 
partially overlapping genomic locations: either because different platforms use different sets of 
probes, or because missing data my occur at different positions across sequences (consider for 
example, mBAF and LRR from the same experiment on one subject: the mBAF signal will be 
defined on a subset of the locations where LRR is). 

Let S indicate the union of all genomic positions where some measurement is available among 
the M signals under study. And let Si be the subset of locations with measurements in sequence 
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i. We reconstruct for all j G 5. When j ^ Si, /3y will be determined simply on the basis of 



The MM strategy previously described applies with slight modifications of the matrix A\ (see 
the supplementary material). 

The attentive reader would have noted that y,ij with j £ Si can be considered as missing 
data, and an evaluation of the characteristics of this missingness is appropriate. In general, 
cannot be considered missing at random. The most important example is the case of mBAF, where 
homozygous markers result in missing values. Now, homozygosity is more common when copy 
number is equal to 1 than when copy number is equal to 2 and, therefore, there is potentially more 
information on to be extracted from the signals than the one we will capture with the proposed 
methodology. On the other hand, it does appear that the approach outlined does not increase false 
positive: operationally, then, it can be considered as an improvement over segmentation based on 
LRR only, even if in theory, it does not completely use the information on BAR It is also relevant 
to note that, in reality, most of the information on deletion is obtained through LRR, and BAF 
is really carrying additional information in case of duplications (where the changes in LRR are 
limited due to saturation effects). 

3.3 Choice of tuning constants and segmentation 

One of the limitations of penalization procedures is that a value for the tuning parameters needs 
to be set and clear guidelines are not always available. Path methods that obtain a solution of 
the optimization problem for every value of tuning parameters can be attractive, but recent 
algorithmic advances (H|43l|55l remain impractical for problems of the size of ours. A number 



the neighboring datapoints, relying on the regularizations introduced in ([3]). The goodness-of-fit 



portion of the objective function is therefore redefined as 




(6) 
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of recent publications obtain optimal values of penalty parameters under a series of conditions 
|3l|5l|6l[T3l: we rely upon them to propose the following strategy consisting of obtaining a solution 
of ([3]) for reasonably liberal values of the tuning parameters, followed by a sequence-by- sequence 
hard thresholding of the detected jumps with a data-adaptive threshold. 

We have found the following guidelines to be useful in choosing penalty parameter values: 

Al,j = Ci<7j, 

A 2 ,» = p{p)c20i A/log N, (7) 
A 3 ,i = [1- p(p)]c 3 cr iy /p~My/logN, 

for i = 1, . . . , M, where &i is a robust estimate of standard deviation of y i9 p is roughly the pro- 
portion of the M sequences we anticipate to carry CNVs, and c\, C2 and c 3 are positive multipliers 
adjusted in consideration of different signal-to-noise ratios and CNV sizes. 

While a more rigorous justification is provided in the supplementary material, we start by 
underscoring some of the characteristics of this proposal. 

• The sequence-specific penalizing parameters are proportional to an estimate of the standard 
deviation of the sequence signal: that is, proviso an initial normalization, the same penalties 
would be used across all signals. 

• The tuning parameter for the total variation (fused lasso) and the Euclidean (group fused 
lasso) penalties on the jumps depend on y/\og~N, where N is the possible number of jumps. 
This has a "multiple comparison controlling" effect and resembles rates that have been 
proven optimal under various sparse scenarios ESilEl- This term does not appear in 
the expression of Ai, as the lasso penalty can be understood as providing a soft thresholding 
of the solution of p]) when Ai = 0: given the penalization due to A2 and A3, this object will 
have much smaller dimensionality than N. 
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• The group penalty depends on v M, where M is the number of grouped sequences, as in the 
original proposal ll50l . 

• The relative weight of the fused-lasso and group-fused-lasso penalties is regulated by p, 
which depends on p, the proportion of the M sequences expected to carry the same CNV. 
For example, if M = 2 and the two sequences are LRR and BAF from the same individual, 
we anticipate p = 1 with p = 0, enforcing jumps at identical places in the two signals. At 
the other extreme, for completely unrelated sequences, p = and p = 1. 

The standard deviation &i can be estimated robustly as follows. Let Ay = yij+i — i/ij, 
for j = 1, . . . , N — 1, be the one-order difference of adjacent yy for sequence i. Then most 
Var(Ay) = 2of except those bridging real change points, so we can take 

&i = SD(Ai)/V2, 

where SD(Ai) = Standard Deviation(Aj) or SD(Ai) = Median Absolute Deiviation(Aj) for 
Aj = {Aj i, . . . , A^at-i}. 

As mentioned before, the exact values of the penalty parameters should be adjusted depending 
on the expectations of signal strengths. Following the approach in OTI , one can approximate the 
bias induced by each of the penalties and hence work backwards in terms of acceptable levels. As 
detailed in the supplementary material, 

Bias(Ai) ps Ai 

Bias(A 2 ) ~ A 2 /Length of segment 

Bias (A3) ~ A3/ (Length of segment x a/# sequences sharing segment) 

Following again the approach in [3 1 ], one can show that under some relatively strong assump- 
tions, the choices in ([7]) lead to a consistent behavior as iV — >■ 00 and M stays bounded (see the 
supplementary material). Despite the fact that N is indeed large in our studies, it is not clear that 
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we can assume it to be in the asymptotic regime. As finer scale measurements become available, 
scientists desire to investigate CNV of decreasing length: the CNVs we are interested in discover- 
ing are often covered by a small number of probes. Furthermore we have often little information 
on the sizes and frequencies of CNV. In this context, we find it advisable to rely on a two-stage 
strategy: 

1. Sequences are jointly segmented minimizing ([3]) for a relatively lax choice of the penalty 
parameters. 

2. Jumps are further thresholded on the basis of a data-driven cut-off. 

Step 2 allows us to be adaptive to the signal strength and can be carried on with multiple methods. 
For example, one can adopt the modified Bayesian Information Criteria (mBIC) [|52l . For sequence 
i, the jumps are sorted as {d^, . . . , di( N -i)} in the descending order of their absolute values. And 
then we choose the first k change points where k is given by 

k = argmax fe mBIC(A;). 

In data analysis, we often apply an even simpler procedure where the threshold for jumps is defined 
as a fraction of the maximal jump size observed for every sequence. Specifically, for sequence i, 
let t>i = max 2 <j<N{\dij\}, where d^ = — Aj-i, be the largest observed jump for sequence i. 
Then we define 

7, = max{a<7j, min{Z)j, b&i}}, for a < b, 

as a "ruler" reflecting the scale of a possible real jump size, taking cji as the cut-off in removal 
of most small jumps. In all analyses for this paper, we fix a = 1, b = 5 and c = 0.2. In our 
experience, this heuristic procedure works well for both tumor and normal tissue CNV data. 
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3.4 Calling Procedure 

Even if this is not the focus of our proposal, in order to compare the performance of our segmen- 
tation algorithm with HMM approaches, it becomes necessary to distinguish acquisitions from 
losses of copy number. While the same segmentation algorithm can be applied to a wide range of 
data sets, calling procedures depend more closely on the specific technology used to carry out the 
experiments. Since our data analysis relies on Illumina genotyping arrays, we limit ourselves to 
this platform, and briefly describe the calling procedure we adopt in Section |4j 

Analyzing one subject at the time, each segment with constant mean is assigned to one of five 
possible copy number states (c = 0, 1, 2, 3, 4). Let R collect the indexes of all SNPs comprising 
one segment and let (x#, y^) = {(xj, yj),j E R} be the vectors of values for BAF and LRR in 
the segment. On the basis of typical pattern for BAF and LRR in the different copy number states 
(see [Hl|45l|471), we can write log-likelihood ratio 

LR(c)=log- ^v + logT -, rr, c = 0,l,3,4, (8) 

-kBAFlXi?; Z) J-'LRRyy Ri ^) 

explicitly defined in the supplementary material. Segment R is assigned a CNV state c that maxi- 
mize LR(c), only if LR(c) > r 1 , where r\ is a pre-specified cut-off. 

As noted in [|53l . the LRR data for a segment with c = 2, ideally normalized to have mean 
0, often has a small non-zero mean, due to experimental artifacts. If the number of SNPs in R 
is sufficiently large, a log-likelihood-ratio criterion as the above would result in the erroneous 
identification of a copy number different from 2. To avoid this, we also require that the size of the 
absolute difference of the mean of LRR from zero be larger than a threshold \yji\ > r 2 a. 

4 Results 

We report the results of the analysis of two simulated and two real data sets, which overall ex- 
emplify the variety of situations where joint segmentation of multiple sequences is attractive, as 
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described in the introduction. In all cases, we compare the performance of the proposed proce- 
dure with a set of relevant, often specialized, algorithms. The penalized estimation method we 
put forward in this manuscript shows competitive performance in all cases and often a substantial 
computational advantage. Its versatility and speed make it a very convenient tool for initial ex- 
ploration. To calibrate the run times reported in what follows, it is relevant to know that all our 
analyses were run on a Mac OS X (10.6.7) machine with 2.93 GHz Intel Core 2 Duo and 4 GB 
1067 MHz DDR3 memory. 

4.1 Simulated CNV in normal samples 

We consider one of the simulated data sets described in [|54l : relatively short deletion and dupli- 
cation (300 comprising 5,10, 20, 30, 40, 50 SNPs each) are inserted in the middle of 13000 SNPs 
long sequences, using a combination of male and female X chromosome data from Illumina Hu- 
manHap550 array, appropriately pre-processed to avoid biases (these steps included a scrambling 
of SNP positions, so to avoid long-range signal fluctuation). This setting mimics the small rare 
CNVs possibly occurring in the genome of normal individuals: in our main analysis, therefore, 
we process one individual at the time, reflecting the typical level of information available to sci- 
entists in these contexts. HMM methods, like PennCNV, are expected to be the most effective 
in this problem; segmentation methods like CBS are closer to our own and therefore also make 
an interesting comparison. As repeatedly discussed, Illumina platform produces two signals for 
one subject: LRR and BAR A segmentation method that can process one signal at the time would 
give its best results using LRR, which carries most of the information. Given this background, 
we compare four methods: PennCNV, CBS on LRR, fused lasso on LRR only, and group fused 
lasso on LRR and mBAF. The implementations we use are those reflected in the software pack- 
ages: PennCNV (version 2010May01), R package DNAcopy for CBS (version 1.24.0) flU and 
our own R package Piet (version 0.1.0). Tuning parameters for PennCNV and CBS are set at the 
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default values; the fused lasso implementation corresponds to Ai = 0.1, A 2 = 2 x ^13000, and 
A 3 = and the group fused lasso to Ai = 0.1, A 2 = 0, and A 3 = 2 x ^13000. To call deletion 
and duplication with CBS and the two fused-lasso approaches, we use both LRR and BAF data 
(before transformed to mBAF) with the following cut-off values: ri = 10 and r 2 = 1(1.5) for 
duplication (deletion). Performance is evaluated by the same indexes we used in ll54l : true positive 
rate (TPR or sensitivity) and false discovery rate (FDR), all defined on a per SNP basis. Results 
are summarized in Table [TJ 

Not surprisingly, all algorithms perform similarly well for larger deletions/duplications and 
it is mainly for variants that involve < 10 SNPs that differences are visible. Algorithms that 
rely only on LRR (as CBS and fused lasso) underperform in the detection of small duplications 
(comparison is particularly easy for duplications of size 10 SNP, where the selected parameter 
values lead to similar FDRs in the three segmentation methods). The group fused lasso can almost 
entirely recover the performance of PennCNV and outperforms CBS in this context. 

For curiosity, we analyzed all sequences simultaneously. While this represents an unrealistic 
amount of prior information, it allows us to evaluate the possible gain of joint analysis: FDR 
practically become (<0.02%) for all CNV sizes, but power increases only for CNV including 
less than 10 SNPs. 

Finally, it is useful to compare running times. Summary statistics of the per sample time are 
reported in Table [TJ while all algorithms are rather fast, the two implementations of the fused lasso 
are dominating. 

4.2 A simulated tumor data set 

To explore the challenges presented by tumor data, we rely on a data set created by ll39~1h with 
the specific goal of studying the effect of contamination between normal and cancer cells. The 
HapMap sample NA06991, genotyped on Illumina HumanHap550 array, was used to simulate a 
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cancer cell line, by inserting a total of 10 structure variation regions, including one-copy losses, 
one-copy gains, and copy neutral loss-of-hetrozygosity (CN-LOH) (see Supplementary Table [2]). 
The signal from this artificial "tumor" sample was then contaminated in silico with that of the 
original "normal" sample, resulting in 21 data sets, with a percentage of normal cells ranging from 
0% to 100%. Note that most simulated CNV or CN-LOH regions are very large — some spanning 
an entire chromosome — and the challenge in detection is really due to the contamination levels. 

For ease of comparison, we evaluate the accuracy of calling procedures as in the original 
reference 11391 : sensitivity is measured for each variant region as the percentage of heterozygous 
SNPs that are assigned the correct copy number; and specificity is the percentage of originally het- 
erozygous SNPs in unperturbed regions that are assigned CN=2. We compare the performance of 
GFL to BAFsegmentation 11391 and PSCN [8| representing, respectively, a version of segmentation 
and HMM approaches specifically developed to deal with contaminated tumor samples (both these 
algorithms have been tested with success on this simulated data set). 

Following other analyses, we do not pre-process the data prior to CNV detection. BAFsegmen- 
tation and PSCN were run using recommended parameter values. For each of the diluted data sets, 
we applied the GFL model on each chromosome at one time using both LRR and mBAF, whose 
standard deviations are normalized to 1. Tuning constants are set to Ai = 0, A2 = 0.5 x 3 x y/\og N, 
and A3 = 0.5 x 3 x y/\og N, varying specifically for chromosome interrogated by N SNPs. The 
change points resulting from hard segmentation on LRR and mBAF are combined to make a finer 
segmentation of the genome. Finally, we adopt the same calling procedure described by Il39"l . 
For ease of comparison with PSCN, only analysis of simulated tumor data are reported, even if 
BAFsegmentation and GFL would gain from using the genotype of normal cell in defining mBAF. 

Figure [T] summarizes the sensitivity of each method, as a function of percentage of normal cell 
in the sample. Sensitivity is calculated for each of the 10 regions separately. All three methods 
work reasonably well under a wide range of percentages of normal cell contamination (in 5 out 
of the 10 regions, GFL appears to lead to best results, while in the other 5 PSCN does). The 
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CNV region that comprises the smallest amount of SNP is the hemizygous loss on Chromosome 
13: in this case GFL in our hands behaved in the most stable manner. GLF outperforms the two 
comparison methods in terms of specificity (Figure [2]): while the specificity values might appear 
very high in any case, this is somewhat of an artifact due to the adopted definition of this index. It 
is relevant to note that the performance of PSCN in our hands does not correspond to the published 
one [8]. While we tried our best to set the parameter values, we have not succeeded in replicating 
the authors' original results, which should be considered in the interest of fairness. 

PSCN, like GFL, is implemented in R with some computationally intensive subroutines coded 
in C. BAFsegmentation relies its segmentation part on the R package DNAcopy, whose core algo- 
rithms are implemented in C and Fortran, and it is wrapped in Perl. A comparison of run times 
indicate that GLF and BAFsegmentation are comparable, while PSCN is fifty times slower than 
GFL (see Supplementary Table [3]). 

4.3 One sample assayed with multiple replicates and multiple platforms 

We use the data from a study [28] assessing the performance of different array platforms and CNV 
calling methods to illustrate the advantages of joint analysis of multiple measurements on the same 
subject. DNA from four individuals was analyzed in triplicate on each of 5 platforms: Affymetrix 
6.0, Illumina 1M, 660W, Omnil-Quad (OIQ) and Omni2.5-Quad (02Q) (among others Il28ln . 
We use the results on the first three to define "true" copy numbers and try to reconstruct them 
using data from OIQ and 02Q. The nine "reference" experiments were analyzed with 4 or 5 CNV 
calling algorithms (see [128 1) and a CNV was identified using majority votes: consistent evidence 
was required from at least 2 analysis tools, on at least 2 platforms, and in at least 2 replicates (see 
Supplementary Table [4]). Here CNVs detected in two replicates/algorithms/platforms are regarded 
as the same CNV and collapse down to one CNV with the outmost boundaries when they overlap 
with each other. 
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The test experiments are based on 1,020,596 and 2,390,395 SNPs on autosomes after some 
quality control, at a total of 2,657,077 unique loci. Since our focus here is to investigate how to 
best analyze multiple signals on the same subject, rather than on the specific properties of any CNV 
calling method, we carry out all the analyses using different settings of GFL in segmentation while 
keeping the same CNV calling and summarizing procedure. All segmentation is done on LRR 
only while calling procedure uses both LRR and BAF (with cut-off T\ = 10 and r 2 = 1). Here we 
compare three segmentation settings to analyze these 6 experiments per subject (see Supplementary 
Table [5] for more details about tuning parameters): 

1. The signals from the three technical replicates with one platform are averaged and then 
segmented and subject to calling procedure separately. The final CNV list is the union of 
CNV calls from the two platforms. 

2. The signals from the three technical replicates with one platform are each segmented and 
subject to calling procedure separately. A majority vote is used to summarize CNV result 
for each platform: a CNV needs to be called in at least two replicates out of three. The final 
CNV list is the union of the two platforms' results. 

3. The signals from the three technical replicates of both platforms (6 LRR sequences) are 
segmented jointly. Calling procedure is still done on each replicate separately, and the same 
majority vote is used to summarize CNV result for each platform. Again, the final CNV list 
is the union of the two platforms' results. 

To benchmark the result of joint analysis we use MPCBS [51 J, a segmentation method, specifically 
designed for multi-platform CNV analysis. The segments output from MPCBS are proceeded to 
the same calling, majority voting, and summarizing procedure. 

Table [2] presents the results: averaging results from different technical replicates leads to loss 
of power, while joint analysis of all the signals leads to the most effective performance. GFL joint 
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analysis leads to results comparable to those of MPCBS, but it is at least 30 times faster than the 
competing method. 

4.4 Multiple related samples assayed with the same platform 

In the context of a study of the genetic basis of bipolar disorder, the Illumina Omni2.5-Quad chip 
was used to genotype 455 individuals from 11 Columbian and 13 Costa Rican pedigrees. We use 
this data set to explore the advantages of a joint segmentation of related individuals. In absence of 
a reference evaluation of CNV status in these samples, we rely on two indirect methods to assess 
the quality of the predicted CNVs. We used the collection of CNVs observed in HapMap Phase III 
|fT9ll to compile a list of 426 copy number polymorphisms (selecting all those CNVs with frequency 
> 0.05 in pooled samples from 11 populations) and assumed that if we identify in our sample a 
CNV corresponding to one of these regions, we should consider it a true positive. For the purposes 
of this analysis we considered a detected CNV to correspond to one identified in HapMap if there 
was any overlap between the two regions. 

Another indirect measure of the quality of CNV calls derives from the amount of Mendelian 
errors encountered in the pedigrees when we consider the CNV as a segregating site. De novo 
CNVs are certainly a possibility, and in their case Mendelian errors are to be expected. However, 
when the CNV in question is a common one (already identified in HapMap), it is reasonable 
to expect that it segregate in the pedigrees as any regular polymorphism. We selected a very 
common deletion on Chromosome 8 (HapMap reports overall frequency > 0.4 in 11 populations) 
and compared different CNV calling procedures on the basis of how many Mendelian errors they 
generate. 

As mentioned before, PennCNV represents a state-of-the-art HMM method for the analysis 
of normal samples and, therefore, we included it in our comparisons. However, the parameters of 
the underlying HMM algorithm had not been tuned on the Omni2.5-Quad at the time of writing, 
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resulting in sub-standard performance. Segmentation methods are less dependent on parameter 
optimization; hence, GFL analysis of LRR and BAF one subject at a time can provide a better 
indication of the potential of single-sample methods. We considered two multiple- sample algo- 
rithms: GFL and MSSCAN [53], both applied on LRR with group defined by pedigree member- 
ships. (While a trio-mode is available in PennCNV ll46l . this does not adapt to the structure of our 
families.) A final qualification is in order. While the authors of MSSCAN kindly shared with us 
a beta-version of their software, we find it not to be robust. Indeed, we were unable to use it to 
segment the entire genome. However, we successfully used it to segment Chromosome 8, so that 
we could include MSSCAN in the comparison based on Mendelian error rates. 

Prior to analysis, the data was normalized using the GC-content correction implemented in 
PennCNV |[T2l . For individual analysis, the GFL parameters were Ai = 0.1, A 2 = 0, and A 3 = 
2 x \/Tog N, where N is the number of SNPs deployed on each chromosome; for pedigree analysis, 
the GFL parameters were Ai = 0. 1, A 2 = 0.5 x 2 x y'log N, and A 3 = 0.5 x 2 x V0.3M x yTog N, 
where M is the number of individuals in each pedigree. For MSSCAN, CNV size is constraint to be 
less than 200 SNPs and the maximum number of change points is set as 50. The calling procedure 
with T\ = 10 and r 2 = 1 was applied to both the GFL and MSSCAN results. 

Table [3] summarized the total number of copy number polymorphisms (CNPs) identified in 
our sample by different approaches and their overlap with known CNPs from HapMap. For the 
purpose of this comparison we considered as a CNP a variant with frequency at least 10% in our 
sample. All analysis modes of GFL agree more with HapMap list than PennCNV in the sense 
of percentage of overlap. It is also clear that GFL-pedigree analysis achieves larger overlap with 
HapMap data than GFL-individual analysis. The time cost per sample for pedigree is reasonable 
and scales well with the increment of sample size. 

Table [4] summarizes the results of our investigation of a 154kb CNP region on Chromosome 8p 
(from 39,351,896 to 39,506,122 on NCBI Build 36 coordinate). All methods but PennCNV show 
detected deletions only; this coincides with the observation from HapMap data. We used option 
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Mistyping of Mendel (version 1 1.0) Il22"ll38l to detect Mendelian errors. Joint segmentation meth- 
ods discover more hemizygous deletions than individual analysis, resulting in fewer Mendelian 
errors. MSSCAN discovers the largest number of hemizygous deletions. Figure [3] shows an exam- 
ple of large pedigree, where 3 out of 4 Mendelian errors are removed by joint analysis. 
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Table 1: Detection accuracy (as percentage of SNPs) and computation times for PennCNV, CBS, 
Fused Lasso and Group Fused Lasso on a simulated set of CNVs in normal samples. Overall ac- 
curacy are calculated pooling all sequences with a given type of CNVs. The average (and standard 
deviation) of the number of seconds required for the analysis of one sequence is reported. 



CNV 


CNV 


PennCNV 


CBS 


Fused Lasso 


Group Fused Lasso 


Size 


Type 


TPR 


FDR 


TPR 


FDR 


TPR 


FDR 


TPR 


FDR 


5 


Deletion 


83.80 


4.92 


78.20 


0.68 


63.93 


1.74 


64.27 


1.83 




Duplication 


58.53 


4.67 


11.67 


10.26 


20.00 


37.76 


39.87 


14.33 


10 


Deletion 


95.03 


1.45 


88.37 


0.56 


88.50 


0.60 


88.87 


0.56 




Duplication 


93.43 


0.78 


56.50 


4.40 


83.90 


12.60 


91.60 


3.85 


20 


Deletion 


94.63 


0.58 


90.50 


0.39 


90.80 


0.47 


90.83 


0.47 




Duplication 


96.13 


0.92 


86.22 


3.58 


92.77 


4.95 


94.98 


2.13 


30 


Deletion 


94.57 


0.28 


93.30 


0.29 


89.38 


0.52 


89.77 


0.53 




Duplication 


96.09 


0.05 


90.77 


1.61 


94.32 


1.78 


94.98 


1.29 


40 


Deletion 


97.83 


0.59 


97.58 


0.09 


97.28 


0.19 


97.28 


0.19 




Duplication 


94.61 


0.46 


92.77 


0.98 


93.94 


1.15 


94.63 


0.75 


50 


Deletion 


94.33 


0.07 


92.76 


0.04 


90.47 


0.11 


90.48 


0.11 




Duplication 


94.50 


0.09 


93.81 


0.74 


93.11 


0.79 


93.64 


0.49 


Overall Deletion 


95.02 


0.55 


93.06 


0.19 


91.08 


0.33 


91.19 


0.34 


Overall Duplication 


93.82 


0.44 


86.92 


1.55 


90.56 


2.85 


92.46 


1.38 




Overall 


94.42 


0.49 


89.99 


0.85 


90.82 


1.60 


91.83 


0.87 


Time (sec.) 


0.48 (0.01) 


0.78 (0.69) 


0.22 (0.13) 


0.28 (0.05) 
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Figure 1: Sensitivity as function of percentage contamination by normal cells in the 10 different 
simulated CNV regions. Sensitivity is not defined at 100% contamination. 
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Figure 2: Specificity as function of percentage contamination by normal cells. Note that IH reports 
better perfomance of PSCN in correspondence of contamination levels 85% , 95% and 100%. 



Table 2: Number of CNVs detected (Det.) and overlapping (Ovlp.) with reference results as well 



as average computation time for four samples under different analyses. 





NA15510 


NA18517 


NA18576 


NA 18980 




Analysis 


#Det. 


# Ovlp. 


#Det. 


# Ovlp. 


#Det. 


#Ovlp 


#Det. 


#Ovlp 


Time (min.) 


Analysis 1 


170 


38 


144 


34 


160 


25 


145 


22 


1.2 


Analysis 2 


102 


36 


109 


33 


93 


25 


91 


20 


3.7 


Analysis 3 


80 


38 


82 


32 


69 


25 


56 


15 


8.5 


MPCBS 


98 


34 


88 


28 


59 


18 


68 


21 


313.9 
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Table 3: The number of detected CNP regions with frequency > 0.1 in our sample by different 
methods and their overlap with a list of CNP regions compiled from HapMap data. Computation 
time (in minute) is per sample. 



Method 


# detected CNVR 


# Overlap 


% Overlap 


Time (min.) 


PennCNV 


189 


63 


33.33% 


3.44 


GFL-Individual (LRR+BAF) 


95 


50 


52.63% 


3.90 


GFL-Pedigree (LRR) 


106 


62 


58.49% 


1.57 



Table 4: Detected copy numbers in a common deletion on Chromosome 8. Across the various 
algorithms, subjects are assigned to one of 4 types of copy number: for each algorithm, we report 
the total numbers of CN^ 2 identified; the total number of "core" families with Mendelian errors; 
and the average computation time (in minute) per sample for the analysis of Chromosome 8. 



Method 


#CN=0 


#CN=1 


#CN=3 


# families with Mendelian errors 


Time (min.) 


PennCNV 


125 


39 


102 


35 


0.19 


GFL-Individual 


123 


97 





20 


0.21 


GFL-Pedigree 


123 


137 





15 


0.09 


MSSCAN-Pedigree 


123 


154 





15 


0.11 
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5 Discussion 



We have presented a segmentation method based on penalized estimation and capable of processing 
multiple signals jointly. We have shown how this leads to improvements in the analysis of normal 
samples (where segmentation can be applied to both total intensity and allelic proportion), tumor 
sample (where we are able to deal with contamination effectively), measurements from multiple 
platforms, and related individuals. Given that copy number detection is such an active area of 
research, it is impossible to compare one method to all the others available. However, for each of 
the situations we analyzed, we tried to select approaches that represented the most successful state- 
of-the-art. In comparison to these, the algorithm we presented performs well: its accuracy is always 
comparable to that of the most effective competitor and its computation time often more contained. 
We believe that for its versatility and speed, GFL is particularly useful for initial screening. 

There are of course many aspects of CNV detection that we have not analyzed in this paper: 
from normalization and signal transformation to FDR control of detected CNVs. There are also 
a number of improvements to our approach that appear promising, but at this stage are left for 
further work: for example, it is easy to modify algorithms so that the penalization parameters are 
location dependent to incorporate prior information on known copy number polymorphisms; more 
challenging is developing theory and method to select the values of these regularization parameters 
in a data-adaptive fashion. 

Finally, while our scientific motivation has been the study of copy number variations, the joint 
segmentation algorithm we present is not restricted to specific characteristics of these data types, 
and we expect it will be applied in other contexts. 

Software implementation 

All the code used to run the analysis presented in this paper is available at the web-page of the 
authors. We have implemented the segmentation routine, which is our core contribution, in an R 
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package (Piet) to be submitted to R-forge (http://r-forge.r-project.org). To demonstrate a visual- 
ization of the CNV results on Chromosome 8 in the bipolar disorder study (see Section 4.4), we 
refer the interested audience to Supplementary Figure [2]in the supplementary material. 
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TDM algorithm 



The non-zero entries in Aj and bj in the re-shaped surrogate function ([5]) are listed as follows: 
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& ! m) (j) = J/y, j = 1, . . . , n. 

When staking measurements at different positions, the item 1 in al m \j } j) is replaced by 5ij 

and b\ mS> = y^ is replaced by b\' = S^y^. 

Bias estimation 

Let Xij be the data for sequence i at locus j after Oi of each sequence is normalized to 1. With such 
normalization, the model ([3]) is reduced to a simpler form with global tuning parameters to each 
sequence for easier interpretation: 
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(S.l) 



The solution to minimize f(/3) is unique for f(f3) is strictly convex. Denote the solution as 
f3 = 0ij)MxN- Suppose sequence i is partitioned into Kj consecutive segments {fl[ , . . . , Rg }, 



delimited with change points Ji = {j[ , . . . , } C {2, . . . , N} (left end of segment 2, . . . , Ki). 
The fitted means of each segment is denoted as fi^ = (/t^ , . . . , /t^ ), i.e., = /tjj. , if j G ^ . 
The length (number of SNPs) of each segment is L k - 
mean vector for sequence % can be written as 



:■(<) 



R^l, k = 1 Ki. Thus, the estimated 



A', 



aw- 



fc=l 



/3 is the optimal solution if and only if it satisfies the subgradient condition df(p) = 0; that 



is, 



Pij = Vij - Ais 



(i) 



(3) 



(S.2) 



where sj^, and are coordinates of subgradient corresponding to fy's appearing in each of 
the three penalty terms. Both bias estimation and asymptotic analysis rely on the analytic form of 
subgradient. Now we discussed the bias induced by each penalty separately. 



.(3) 



Bias induced by lasso penalty 



It is easy to verify that the subgradient for the lasso penalty can be written as 

s\f = sign(/%), 



where, with a bit abuse of notation, 



sign (x J 



1, 



if x > 0, 



if x < 0, 



(S.3) 



z G [-1,1], if a; = 0. 

Hence, the lasso penalty term merely plays as a soft-thresholding on the fitted values resulted from 



the model (S.l ) with Ai = 0, denoted as $ij(0, A2, A3); that is, for any Ai > 0, 



Aj(Ai, A2, A3) 



sign 



4(0,A 2 ,A 3 ) 4(0, A 2 , A 3 ) -Ax 



where (x) + = max{x, 0}. This is also highlighted in Lamma A.l of |fT5l for model (S.l ) with 



Bias induced by fused-lasso penalty 



In model (S.l ) with Ai = and A3 = (only fused-lasso penalty involved), Lemma 2.1 in OT 
gives an insightful characterization of 
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0. 

The result implies that the sample mean (as an unbiased estimate of true mean) of a local mini- 
mum/maximum segment (except it is located at either end) is shifted towards due to fused-lasso 
penalty. The bias is positively proportional to A2 and negatively proportional to the length of the 
segment. It is more important to notice that there exists no configuration where a local mini- 
mum/maximum segment has a jump size (relative to neighboring segments) less than the amount 
of bias. It means that a CNV with small jump size or small length could possibly be merged into 
neighboring segments, if A 2 is set too large. 



Bias induced by group-fused-lasso penalty 

The subgradient for group-fused-lasso penalty is given in the following Proposition 1. 
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Proposition 1: The /^/s involved in group-fused-lasso penalty have subgradient given by 



.(3) 



-e». 2 , if i = 1 , 

eij - eij+i, if 1 < j < N, 
e iN , if j = N, 

for % = 1, . . . , M, where e, = (e^, . . . , eujf for j = 2, . . . , M are given by 
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> o, 
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(S.4) 



(S.5) 



Proof: The proof follows a similar technique used in the proof of Lamma A.l in PH . Let 
T = [— Ifcr, Im]> where 1m is M x M identity matrix. Then, for any 2 < j < N, 

For the j such that | — I Vi > 0' tne sub-gradient is reduced to regular gradient, and thus 

can be derived in a usual way. We now focus on the j such that \\/3u) — Pfj-i)\\h = 0' ^ e 
subgradient of /?y at 0. By Cauchy-Schwartz inequality, we have 

- \\ r ^[P'{j-l)^'ij)} T \\i2\\ e j\\i2 

> <T[(3l_ 1): (3lf,e j > 

= MO)+<[/3j_ 1) ,/3j. ) ] r -0,T^> 

where e, is any vector such that ||ej||f 2 < 1. It follows by the definition of subgradient that 
T T ej = [-ej, eJ] T is the subgradient for {/3f j _ 1) ,/3j j) } T . □ 

The bias induced by the group-fused-lasso penalty can be derived from the analytic form of 
subgradient accordingly and is given in the following Proposition 2. 
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Proposition 2: In model (S.l) with A x = and A 2 = 0, the fitted means of segments for 
sequence i can be expressed as 



where 



and 



-(»') 



77 £ 



r,(0 



i(0 



_Aa_ . r .f9- (i) l 
f(i) '»Ul )i 
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Proof: The proof follows a similar technique used in the proof of Lemma 2.1 in [31]. Follow 



ing the subgradient condition (S.2) in case Ai = and A2 = 0, we have 



A3 



E 



(3) 
5- ■ . 



By Proposition 1 and simple algebra, we have 



E 



(3) 
?• ■ 



— e. -.(;), 



if Jfe = 1, 



e, •« -e , w , if 2 < A; < i^i — 1, 



i.r. 

J Ki-l 



if k = Ki. 



Note that at jump points, subgradient has explicit form as shown in Proposition 1. It follows that 
e. -.(<) = Ti(j^), for = 1, . . . , Ki — 1, where rj(-) is defined in Proposition 2. □ 



Some interesting implications follow immediately. For sequence i, consider one of its fitted 
segment k with end points — 1]. If no other sequences share change points at these two 



(i) 

ends, then the bias term c k reduces to what it appears in model <|S . 1 J) with fused-lasso term only 



(Ai = and A3 = 0). If m out of M sequences share change points at these two ends and also 
assume the jump size at these two locations for all the m sequences are roughly the same, then the 
absolute value of the bias term can be approximately written as ^ ■ -4=. It means that if more 
than one sequences share change points at the same coordinate, then they can benefit from each 
other to reduce their individual bias, relative to the bias induced by fused-lasso penalty specific to 
each individual sequence. 

Asymptotic behavior 

Now we try to give a justification of the order of the magnitude of A 2 and A 3 in compatible with 
their large sample behavior, say, as N — > 00. When the number of sequences M in segmentation 
task is relatively large, extra caution is needed for A3. Again, we discuss asymptotic behavior of 
the solution influenced by fused-lasso and group-fused-lasso separately for easier exhibition. 

Asymptotic behavior for fused-lasso penalty 

In fused-lasso model (Ai = and A3 = 0), the justification is directly inspired by the proof of 
Theorem 2.3 in [1311 . Denote the event 

Si = {Ji = Ji} H {sign(/% - Aj-i) = sign(/3y - Aj-i), Vj e Ji}, 

for i = 1, . . . , M respectively. This event means that all jump points and the direction of jumps 
are correctly identified for each sequence i. A necessary condition required for A2 is summarized 
in Proposition 3. 

Proposition 3: It is required that A2 = 0(y/\og N) to ensure limAr^oo P(£») = 1 for i = 
1, . . . , M, at the linear rate. 

This asymptotic behavior follows directly the proof of Theorem 2.3 in PTfl . We have some 
quick remarks: 



1) If the signal of each sequence is not normalized, then A 2i = c 2 <7j\/log N, specific to se- 
quence i. 

2) In order to ascertain a CNV segment with length L and jump size 5, the bias needs to satisfy 
2^ = 2c 2 a t ^N < ^ i e ^ C2 < . ^ L Rere £ can be interpreted as signal-to- 
noise ratio (SNR). For a specific platform, one may get a sense of the magnitude of SNR 
and L from prior knowledge. In practice, it is desired to take as large value of c 2 as possible 
to ensure the sparsity of the segmentation, but not too large in order to compensate for the 
constraint of signal strength (^rL). Based on our experiences of analysis of Illumina data 
ll54ll . the results are not sensitive to the choice of c 2 , provided that it falls into a reasonable 
range. 

Asymptotic behavior for group-fused-lasso penalty 

In group-fused-lasso model (Ai = and A 2 = 0), we have similar requirement of A3 as for A 2 , 
which is given in Proposition 4. 

Proposition 4: It is required that A 3 = 0(y/M y/log N) to ensure lirnAr-Kx> P(n^ 1 £ i ) = 1, at 
the linear rate. 



Proof: For simplicity, we prove under the condition that are i.i.d. M(0, 1) (after Oi is 
normalized to 1), while this condition can be relaxed [31 J. We also follow the same technique used 
in the proof of Theorem 2.3 in [13T|| . Let dij = /3y — d%j — fiij ~ an d d\- = — 



Also denote = (d\p . . . , d e Mj ) T and J = Ufi^. By the subgradient condition (S.2), for each 
i, Si holds if and only if 

d^j = A 3 (2e^ - eij_i - e iJ+1 ), for j e J£, (S.6) 

and 

l4l>0, for jeJi. (S.7) 



Condition (IS .71) has direct relevance to the bias issue, as discussed above. Now we focus on 



condition (S.6), which implies that 



max \ \dj\ 



max As I |2e,- 



e j"+lIUa < 4^3- 

maxjgj-c | |d 



2111 > 8A§) -> as JV -> 



It is left to show that P(maxj ei 7c | |d*- 1 > 4A 3 , m , > , T ,. , , , 
oo for % = 1, . . . , M. Note that for each j, d\ . . . , c?mj are i.i.d. jV(0, 2), so | \d e -/ y/2\ \ l2 
Then we have 



2 ~ Xm- 



(max lid' 
(U, e ^||d} 



2||| 2 > 8A|) 



2llf 2 > SAD 



< E 

J£J C 



>8A 



\j c \nm/^\\i>^\ 



< exp 



-_(8A£-Jlf)+log|,T|- } 



M M 



8A 



Here the first inequality is due to union bound and the second inequality is due to Chernoff 's bound 
for Xm distribution. Under the assumption on sparsity of the change points, we have | J c \ = O(N) 
for fixed M. In our settings, M is fixed (which may rise up to thousands) while N — > oo, yet in 
practice, M is not negligible with respect to \/\ogN. For example, v/log(10 6 ) ~ 3.72, and it is 
not uncommon to have more than 4 sequences for joint segmentation. Therefore, it is necessary to 



have A 3 = 0(VM^/hg~N). □ 



We also have some remarks on how to determine A 3 : 

1) If the signal of each sequence is not normalized, then \ 3>i = c^Ui^/pM \/\og N. The choice 
of p is decided case by case and discussed in the main text. 

2) Following the above discussion about bias induced by group-fused-lasso penalty, if m out 
of M sequences carry CNVs with exactly the same boundary, the bias can be approximately 
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written as C3<J i( i) ° g • On one hand, if p is over estimated so that pM is much larger 

than m, the model would be over penalized and introduce more bias than that is attributed 
to individual fused-lasso penalty, and thus does not benefit from joint analysis; On the other 
hand, if pM is set too small, we have insufficient control on the sparsity of each sequence, 
so that it has to be compensated by the fused-lasso penalty. This is the reason why we need 
to incooperate p(p) to re-weight the relative influence of the two penalties. 

Details in calling procedure 

We specify the likelihood functions of LRR and BAF signals in the log-likelihood ratio ([8]) as 
follows. For BAF signal, the likelihood is usually modeled for different copy number states as a 
mixture of densities surrounding a few possible BAF values corresponding to different genotypes 
|9l @7]|. When population frequencies for allele A and B, p A and p B , are available or can be 
estimated from data, we have 

LbafOe; c ) = ^2( C )p c a s Pb<Ps{x; p s , o*), for c = 0, 1, 2, 3, 4, 
s =o w 

where S (-; p, s , a 2 s ) is normal density for state s. The details in model and parameter specification 
are listed in Supplementary Table [T] 

In case where population frequencies p A and p B are not available, we might use an alternative 
likelihood function for BAF [54], defined by 

L BAF (x; c) = max 4> s (x; jj s , a 2 ), for c = 0, 1, 2, 3, 4, 
se{o,...,c} 

where all parameters are defined in the same way (see Supplementary Table [T]). 

For LRR signal, the likelihood function is simply defined by normal density: 

£lrr(?/; c) = (f)(y;p c ,al). 
10 



c 


s 


Genotype 


M-) 


Us 










Null 


normal 


1/2 


10a x 


1 


0,1 


A, B 


half normal 


0,1 




2 


0,2 


AA, BB 


half normal 


0,1 


o x 




1 


AB 


normal 


1/2 




3 


0,3 


AAA, BBB 


half normal 


0,1 






1,2 


AAB, ABB 


normal 


1/3, 2/3 


d~x 


4 


0,4 


AAAA, BBBB 


half normal 


0,1 


O'x 




1,2,3 


A AAB, AABB, ABBB 


normal 


1/4, 1/2, 3/4 


&X 



Supplementary Table 1 : Model and parameter specification in BAF signal for each copy number 
state. a x is empirically estimated from BAF values in (0.4, 0.6) for each individual. 

For c = 0, 1, 3, 4, \x c and cr? are estimated based on the data y R in segment R being considered, 
while fjL 2 and a\ are estimated from the data of the whole chromosome on which segment R locates 
or, locally, from the data of a few hundred markers flanking the segment. 

Additional Results 



ll 



Region 


Aberration Type 


Chr 


bp Start 


bp End 


#SNP 


#hetSNP 


1 


CN-LOH 


5 


1 


47700000 


9397 


2756 


2 


Loss 


5 


111789971 


112521346 


156 


79 


3 


Gain 


8 


1 


45200000 


12564 


3830 


4 


Gain 


8 


128432670 


129207869 


218 


91 


5 


Loss 


9 


1 


50600000 


11201 


3889 


6 


Loss 


10 


84504379 


94825178 


1988 


648 


7 


Gain 


12 


1 


132449811 


27131 


8818 


8 


Loss 


13 


31766569 


31892852 


37 


10 


9 


CN-LOH 


17 


7431864 


11747138 


1150 


308 


10 


CN-LOH 


17 


22300000 


78774742 


9713 


3205 


Total number of modified heterozygous SNPs 
Total number of heterozygous SNPs on autosome 
Total number of SNPs on autosome 




23634 
176207 
547359 



Supplementary Table 2: Regions of allelic imbalance imposed to the HapMap sample NA06991 



Method Time per sample in sec. (mean (std dev)) 
GFL 21.97(1.31) 
BAFsegmentation 41.73 (-) 

PSCN 1154.18 (74.73) 
Supplementary Table 3: Speed comparison of three methods: GFL, BAFsegmentation and PSCN. 
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Sample 


Gender 


Ancestry 


Resource 


Tvce 


<10k 


10-50k 


50- 100k 


>100k 


Total 










loss 


12 


25 


3 


7 


47 


NA15510 


Female 


European 


PDR 


pain 


o 


o 


1 


4 


5 










total 


12 


25 


4 


11 


52 










loss 


10 


22 


4 


4 


40 


NA18517 


Female 


YRI 


HapMap 


gain 


1 


3 


1 


8 


13 










total 


11 


25 


5 


12 


53 










loss 


13 


16 


4 


5 


38 


NA 18576 


Female 


CHB 


HapMap 


gain 





2 


2 


4 


8 










total 


13 


18 


6 


9 


46 










loss 


8 


16 


1 


4 


29 


NA18980 


Female 


JPT 


HapMap 


gain 








1 


3 


4 










total 


8 


16 


2 


7 


33 



Supplementary Table 4: Sample information and reference CNV regions summarized for each 
sample by their types and sizes. The ancestry of NA15510 was not recorded but inferred in 11201 . 
Abbreviation: PDR = Polymorphism Discovery Resource. 



13 







NA15510 


NA18517 


NA18576 


NA 18980 




Analysis 


p M 


# Det. # Ovlp. 


# Det. # Ovlp. 


#Det. 


#Ovlp 


#Det. 


#Ovlp 


Time (min.) 


Analysis A: GFL done on averaged signal for each platform 


OIQ 


1 1 


92 34 


73 22 


71 


21 


69 


20 


0.3 


02Q 


1 1 


114 22 


92 24 


111 


15 


95 


11 


0.9 


Union 




170 38 


144 34 


160 


25 


145 


22 


1.2 


Analysis B: GFL done on averaged signal of both platforms jointly 




2 


128 40 


108 33 


96 


21 


104 


23 


4.2 


Analysis C: GFL done on three replicates separately for each platform 


OIQ 


1 1 


66 31 


65 22 


43 


19 


48 


15 


0.9 


02Q 


1 1 


68 23 


65 22 


65 


12 


59 


13 


2.8 


Union 




102 36 


109 33 


93 


25 


91 


20 


3.7 


Analysis D: GFL done on three replicates jointly for each platform 


OIQ 


3 


64 32 


66 22 


54 


21 


53 


18 


1.1 


02Q 


3 


75 22 


70 24 


65 


11 


49 


12 


3.1 


Union 




106 36 


115 33 


96 


22 


83 


21 


4.2 


Analysis E: GFL done on three replicates of both platforms jointly 




6 


80 38 


82 32 


69 


25 


56 


15 


8.5 


MPCBS: Segmentation done on three replicates of both platforms jointly 






98 34 


88 28 


59 


18 


68 


21 


313.9 



Supplementary Table 5: Number of CNVs detected (Det.) and overlapping (Ovlp.) with reference 
results as well as average computation time for four samples under different analyses. Tuning 
parameters used in segmentation: c\ = 0.1, c 2 = 2, c 3 = 2 and p = 1; p and M are specified for 
each analysis. Analysis A, C and E correspond to Analysis 1, 2 and 3 respectively in Table [2] of 
main text. 
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I I I I I I I I 

111.5 112.0 112.5 113.0 111.5 112.0 112.5 113.0 

Position (Mb) Position (Mb) 



(a) Individual analysis (b) Joint analysis 




5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 

Normal cell contemination (%) 



Supplementary Figure 1 : Comparison of fitted profiles between analysis for each tumor sample 
with different normal cell contamination levels and joint analysis for all 21 tumor samples. Shown 
is a hemizygous loss on Chromosome 5q22. In each of the subplots, the upper panel shows the 
fitted profiles on LRR for each sample distinctly marked by a spectrum of colors , while the lower 
panel shows their corresponding fitted profiles on mBAF. Shown are data points for heterozygous 
makers, (a) Individual analysis; (b) Joint analysis. 
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Chromosome 8 



CROis |« 
;;n::i4.3. 
CFSHS(7 



CG27HS) 
C025 (13) 
C023(U) 

coiB(ao) 

C015<11) 
C014 (30) 
C013(1S) 

com (18) 




it 



i — i — i — i — i 

5 10 20 



5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 

Position (Mb) 




Supplementary Figure 2: Visualization of pedigree-wise CNV analysis results of Chromosome 8 
data in bipolar disorder study. In the main body of the plot, CNVs estimated for each individual 
are marked by small segments with color code: CN=0 in blue, CN=1 in light blue, CN=3 in 
red and CN=4 in brown. Each subject is a row, each SNP a column. Subjects belonging to the 
same pedigree are stacked together. The pedigree names are indicated on the left hand side with 
the number of pedigree members included in parentheses. On the right hand side, the barplot 
represents the number of CNV detected per subject. Two shades of green are switched alternately 
to indicate the pedigree to which the subject belongs. At the bottom, the gray histogram shows 
the GC content along the chromosome; coordinated with the representation of CNVs in the main 
body, the green histogram counts the frequency }$f CNV among the subjects represented. Vertical 
dotted line marks the centromere. 



